Abstract:Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.
Abstract:Attention matrices are fundamental to transformer research, supporting a broad range of applications including interpretability, visualization, manipulation, and distillation. Yet, most existing analyses focus on individual attention heads or layers, failing to account for the model's global behavior. While prior efforts have extended attention formulations across multiple heads via averaging and matrix multiplications or incorporated components such as normalization and FFNs, a unified and complete representation that encapsulates all transformer blocks is still lacking. We address this gap by introducing TensorLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction tensor. This tensor jointly encodes attention, FFNs, activations, normalizations, and residual connections, offering a theoretically coherent and expressive linear representation of the model's computation. TensorLens is theoretically grounded and our empirical validation shows that it yields richer representations than previous attention-aggregation methods. Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding. Our code is attached as a supplementary.




Abstract:Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.




Abstract:Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.




Abstract:We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: https://github.com/boazlavon/eg_cfg




Abstract:Recently, techniques such as explicit structured reasoning have demonstrated strong test-time scaling behavior by enforcing a separation between the model's internal "thinking" process and the final response. A key factor influencing answer quality in this setting is the length of the thinking stage. When the reasoning is too short, the model may fail to capture the complexity of the task. Conversely, when it is too long, the model may overthink, leading to unnecessary computation and degraded performance. This paper explores and exploits the underlying mechanisms by which LLMs understand and regulate the length of their reasoning during explicit thought processes. First, we show that LLMs encode their progress through the reasoning process and introduce an interactive progress bar visualization, which is then used to reveal insights on the model's planning dynamics. Second, we manipulate the internal progress encoding during inference to reduce unnecessary steps and generate a more concise and decisive chain of thoughts. Our empirical results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency. Our code is publicly available.
Abstract:In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.
Abstract:Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
Abstract:Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.
Abstract:In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.